Prediction of Martensite Start Temperature in Steels Using Cross Validation

Mounika Chevva, Hooman Sabarou(Advisor: Dr. Samantha Seals)

2024-11-17

Introduction

Cross-validation Overview

  • A statistical technique for evaluating the performance and generalizability of machine learning models.

  • Divides dataset into training and validation subsets.

  • Ensures model training on one subset and validation on another.

Advantages:

  • Provides more reliable estimates of model performance.

  • Reduces bias compared to a single train-test split.

  • Improves model generalizability by leveraging different training and validation data.

Methods

  • K-Fold Cross-Validation: Dataset is split into k folds, model is trained on k-1 folds and validated on the remaining fold, repeated k times (Kohavi1995?).

  • Leave-One-Out Cross-Validation (LOOCV): A special case of K-Fold where k equals the number of observations, each sample serves as the validation set once.

  • Nested Cross-Validation: Used for model selection and hyperparameter tuning, an outer loop for validation and an inner loop for training and hyperparameter optimization.

Model Measures of Error (MOE)

  • Definition: Measures of Error (MOE) quantify the difference between predicted values and actual outcomes, helping assess model performance.

\text{MAE} = \frac{1}{n} \sum_{i=1}^{n} |y_i - \hat{y}_i|

where y_i is the actual value, \hat{y}_i is the predicted value,and n is the total number of observations.

  • The square root of the MSE, providing error in the same units as the target variable (chai2014root?).

RMSE = \sqrt{\frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2}

where y_i is the observed value, \hat{y}_i is the predicted value,and n is the total number of observations.

  • Represents the proportion of variance in the dependent variable that can be explained by the independent variables (draper1998applied?).

R^2 = 1 - \frac{\sum_{i=1}^{n} (y_i - \hat{y}_i)^2}{\sum_{i=1}^{n} (y_i - \bar{y})^2} where \bar{y} is the mean of the actual values.

Cross Validation Methods

  1. Divide the dataset (D) into (K) equally sized subsets (folds).

  2. For each fold (k) (where (k = 1, 2, , K)):

    • Train the model (M) on the (K - 1) folds and validate it on the (k)-th fold.
    • Calculate the performance metric (P_k) (e.g., accuracy, MAE) on the (k)-th fold.
  3. The overall performance metric is then averaged over all (K) folds:

\text{CV}(M) = \frac{1}{K} \sum_{k=1}^{K} P_k

  1. Divide the dataset (D) into (n) subsets (where (n) is the number of observations).

  2. For each observation (i) (where (i = 1, 2, , n)):

    • Train the model (M) on the remaining (n - 1) observations.
    • Validate the model on the (i)-th observation.
    • Calculate the performance metric (P_i) (e.g., accuracy, MAE) on the (i)-th observation.
  3. The overall performance metric is then averaged over all (n) observations:

\text{LOOCV}(M) = \frac{1}{n} \sum_{i=1}^{n} P_i

  1. Outer Loop:

    • Divide the dataset (D) into (K) outer folds.
    • For each outer fold (k) (where (k = 1, 2, , K)):
      • Reserve the (k)-th fold as the validation set.
      • Use the remaining (K - 1) folds as the training set.
  2. Inner Loop:

    • For each training set from the outer loop, perform (M) inner folds:
      • Divide the training set into (M) inner folds.
      • For each inner fold (j) (where (j = 1, 2, , M)):
        • Train the model (M) on the (M - 1) inner folds.
        • Validate it on the (j)-th inner fold.
        • Calculate the performance metric (P_{kj}) on the (j)-th inner fold.
  3. Performance Metrics:

    • Average the inner loop performance metrics for each outer fold:

    \text{Inner CV}(M) = \frac{1}{M} \sum_{j=1}^{M} P_{kj}

    • The overall performance metric of the model is averaged over all outer folds:

    \text{Nested CV}(M) = \frac{1}{K} \sum_{k=1}^{K} \text{Inner CV}(M)

Introduction to the Dataset

Martensite Starting Temperature

  • Materials Science Dataset about Steel
  • Martensite Starting Temperature (Ms in degree Celsius) & chemical elements (weight percent)
  • Depending on the chemistry of a steel, Ms changes
  • It is important as it controls strength of Steel
  • The data has 16 variables for 1543 observations

Application

Data Exploration and Visualization

In our study, we analyzed a dataset from (wentzien2024machine?) Martensite dataset focuses on predicting the Martensite Start Temperature (Ms) in steel alloys based on their chemical compositions.

  • Martensite start temperature (Ms) is target variable.
  • “C”,“Mn”,“Si”,“Cr”,“Ni” are Predictor variables

Correlation_Matrix

Modeling and Results

Linear Regression Model

Linear regression is a fundamental statistical technique that establishes a relationship between a dependent variable and one or more independent variables by fitting a linear equation to observed data.

In our dataset, which focuses on predicting the Martensite Start Temperature (Ms) of steel based on its chemical composition (C, Mn, Si, Cr, Ni), linear regression allows us to quantify how changes in these elements influence Ms.

M_s = \beta_0 +\beta_1 C +\beta_2 Mn + \beta_3 Si + \beta_4 Cr + \beta_5 Ni

M_s = 746.99 - 254.85 C - 24.24 Mn - 13.28 Si - 7.8 Cr - 14.64 Ni

Linear Regression Model Coefficients
Term Estimate Std_Error t_value p_value
(Intercept) 746.99268 4.0289613 185.405771 0.0000000
C -254.85890 5.7347802 -44.440919 0.0000000
Mn -24.24356 2.5175491 -9.629826 0.0000000
Si -13.28195 3.6933099 -3.596218 0.0003357
Cr -7.82620 0.7366216 -10.624451 0.0000000
Ni -14.64102 0.2895086 -50.571976 0.0000000

Linear Regression Coefficients

Statistics

Residual standard error: 54.28 on 1230 degrees of freedom

Multiple R-squared: 0.7433,

Adjusted R-squared: 0.7422

F-statistic: 712.2 on 5 and 1230 DF,

p-value: < 2.2e-16

Cross Validation Results for Linear Regression

The results of this analysis reveal that the models tested with 5-Fold Cross-Validation (5-Fold CV) and Leave-One-Out Cross-Validation (LOOCV) demonstrate impressive predictive accuracy. These two models consistently outshine the Nested CV model, indicating that they are more dependable for making predictions from the dataset.

Measure_of_Error Result_Value
RMSE 48.27
MAE 32.28
R2 0.81

K Fold Cross-Validation

Measure_of_Error Result_Value
RMSE 48.27
MAE 32.28
R2 0.81

LOOCV Cross-Validation

Measure_of_Error Result_Value
RMSE 53.28
MAE 33.46
R2 0.75

Nested Cross-Validation

Support Vector Machines (SVM) for Regression (SVR)

Support Vector Machines (SVM) are algorithms that model data by finding optimal boundaries, handling nonlinear patterns using kernels. Using our dataset, Support Vector Machines (SVM) with a radial kernel help predict the martensite start temperature (Ms) based on chemical elements like C, Mn, Ni, Si, and Cr. SVM works by finding the best way to capture the relationship between these variables, effectively handling complex patterns for accurate Ms predictions.

Measure_of_Error Result_Value
RMSE 35.93
MAE 20.98
R2 0.90

SVM_K Fold Cross-Validation

Measure_of_Error Result_Value
RMSE 52.61
MAE 28.49
R2 0.79

SVM_LOOCV Cross-Validation

Measure_of_Error Result_Value
RMSE 40.09
MAE 22.24
R2 0.86

SVM_Nested Cross-Validation

Model Comparision Results

In this study, we compared the performance of Linear Regression and Support Vector Machine (SVM) models in predicting the martensite start temperature (Ms). Using 5-fold cross-validation, we found that SVM outperformed Linear Regression, achieving a lower MAE (~21 vs ~33), a higher R² (~0.9 vs ~0.55), and a lower RMSE (~25 vs ~48), highlighting its superior accuracy and reliability in making predictions.

Method Measure_of_Error Linear_Regression SVM
5-Fold RMSE 48.27 35.93
5-Fold MAE 32.28 20.98
5-Fold R2 0.81 0.90
LOOCV RMSE 48.27 52.61
LOOCV MAE 32.28 28.49
LOOCV R2 0.81 0.79
Nested CV RMSE 53.28 40.09
Nested CV MAE 33.46 22.24
Nested CV R2 0.75 0.86

Model Comparision Results

Model Comparision Plot

Model Comparision Results Plot

Methodology-Hooman

  • Modeling Approach:
    • Untransformed Model: Directly modeled Ms using predictors like C, Mn, Ni, Si, Cr, with interaction terms.

    • Log-Transformed Model: Modeled log(Ms) to handle non-normality and stabilize variance, using the same predictors and interaction terms.

    • Model Improvements (Predictors’ Removal, Introducing Interaction Parameters, Outliers’ Removal)

    • Model Diagnostics (ANOVA, AIC, Cross-Validation, Check for Multicollinearity, Influential Points’ Removal)

    • Model Evaluation: The log-transformed model showed significantly better performance with a lower AIC and cross-validation MSE. Residual deviance and cross-validation confirmed that the log model generalized better to unseen data.

  • Cross-Validation Refinement:
    • K-Fold Cross-Validation with More Folds
    • Leave-One-Out Cross-Validation (LOOCV)
  • Programing has been done by R (R Core Team 2021) in Rstudio (version 2024.04.2)
  • Utilized packages: tidyverse (Wickham et al. 2019), classpackage (Buker and Seals 2024), ggplot2 (Wickham 2016), psych (William Revelle 2024), and boot A. C. Davison and D. V. Hinkley (1997)

Models

  • First Model:

Ms = 769.41 -286.71 C -16.42 Mn -14.04 Ni - 13.89 Si - 10.13Cr -41.45C:Mn - 8.36 C:Ni

Variables Mean ± SD Correlation Coefficient P-value
C 0.36 ± 0.1 -286.71 < 2e-16
Mn 0.79 ± 0.3 -16.42 1.36E-13
Ni 1.55 ± 0.5 -14.04 < 2e-16
Si 0.35 ± 0.2 -13.89 1.70E-13
Cr 1.04 ± 0.7 -10.13 < 2e-16
C:Mn N/A -41.45 < 2e-16
C:Ni N/A -8.36 9.68E-10

Models

  • Second Model:

log(Ms) = -6.69 - 0.51C - 0.03 Mn - 0.03 Ni - 0.03 Si - 0.02Cr - 0.07 C:Mn - 0.01C:Ni

Variables Mean ± SD Correlation Coefficient P-value
C 0.36 ± 0.1 -0.51 < 2e-16
Mn 0.79 ± 0.3 -0.032 < 2e-16
Ni 1.55 ± 0.5 -0.0255 < 2e-16
Si 0.35 ± 0.2 -0.0226 4.48E-13
Cr 1.04 ± 0.7 -0.0175 < 2e-16
C:Mn N/A -0.0751 < 2e-16
C:Ni N/A -0.0154 1.01E-11

Model Performance Summary

Model Cross-Validation RMSE MAE
First Model 5-Fold 27.79 20.43 0.90
Second Model 5-Fold 0.05 18.28 0.91
First Model LOOCV 27.80 20.43 0.90
Second Model LOOCV 0.05 22.02 0.91

Summary

Model Choice:

  • The Log Model slightly outperforms the Regression Model in terms of RMSE and R^2, indicating better fit and prediction accuracy on the logarithmic scale. For tasks requiring high precision and smaller errors, the log model is preferable.

Cross-Validation Choice:

  • 5-Fold: Offers slightly lower MAE and RMSE, indicating better stability when training on subsets of the data.

  • LOOCV: Is slightly more sensitive to data variations but confirms consistent results with 5-Fold.

Conclusion:

  • Use the Log Model for predictions.
  • Rely on 5-Fold Cross-Validation for performance evaluation due to its computational efficiency and similar results to LOOCV.

Conclusion: Overview

Evaluation of Two Models:

Linear Regression Model

Support Vector Machine Model

Cross-validation Methods Used:

k-fold Cross-validation

Leave-one-out Cross-validation (LOOCV)

Nested Cross-validation

Conclusion: Key Findings

  • Mean Absolute Error (MAE): SVM performed much better, with a lower MAE (~21) compared to Linear Regression (~33), meaning SVM’s predictions were closer to the actual values.

  • R-squared (R²): SVM showed a significantly higher R² (~0.9) than Linear Regression (~0.55), indicating that SVM explained 90% of the data’s variability, while Linear Regression only accounted for about 55%.

  • Root Mean Squared Error (RMSE): SVM had a much lower RMSE (~25) compared to Linear Regression (~48), which reflects its greater accuracy and fewer large prediction errors.

  • Overall Performance: Across all key metrics, SVM outperformed Linear Regression, proving to be a more accurate and reliable model for predicting martensite start temperature (Ms).

References

A. C. Davison, and D. V. Hinkley. 1997. Bootstrap Methods and Their Applications. Cambridge: Cambridge University Press. doi:10.1017/CBO9780511802843.
Angelo Canty, and B. D. Ripley. 2024. Boot: Bootstrap r (s-Plus) Functions.
Buker, Ihsan, and Samantha Seals. 2024. Classpackage: Functions for Intro Statistics Courses at the University of West Florida. https://github.com/ieb2/classpackage.
R Core Team. 2021. R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing. https://www.R-project.org/.
Wickham, Hadley. 2016. Ggplot2: Elegant Graphics for Data Analysis. Springer-Verlag New York. https://ggplot2.tidyverse.org.
Wickham, Hadley, Mara Averick, Jennifer Bryan, Winston Chang, Lucy D’Agostino McGowan, Romain François, Garrett Grolemund, et al. 2019. “Welcome to the tidyverse.” Journal of Open Source Software 4 (43): 1686. https://doi.org/10.21105/joss.01686.
William Revelle. 2024. Psych: Procedures for Psychological, Psychometric, and Personality Research. Evanston, Illinois: Northwestern University. https://CRAN.R-project.org/package=psych.